Skip to content

refactor(integration_tests): unify bridge models and clear dbt deprecations#1325

Open
rabee05 wants to merge 9 commits intotuva-health:mainfrom
rabee05:main
Open

refactor(integration_tests): unify bridge models and clear dbt deprecations#1325
rabee05 wants to merge 9 commits intotuva-health:mainfrom
rabee05:main

Conversation

@rabee05
Copy link
Copy Markdown
Collaborator

@rabee05 rabee05 commented Apr 24, 2026

Problem

  1. Every bridge model had two SELECTs gated by use_synthetic_data duplicated column lists that drifted apart.
  2. No DAG edge from raw_data__* seeds to bridge models, so dbt build sometimes ran models before seeds and failed with Dataset raw_data not found.
  3. Mixed model layouts (inline vs tuva_columns/tuva_extensions/tuva_metadata) and ~500 copies of the same cross-adapter type jinja across seed yml files.
  4. dbt parse surfaced three deprecations and one BigQuery error (cast(… as varchar)).

Fix

  • tuva_source() macro — returns ref('raw_data__<table>') in synthetic mode, source('source_input', <table>) otherwise. Compile-time ref() registers the seed → model DAG edge.
  • Unified 15 bridge models on the tuva_columns / tuva_extensions / tuva_metadata layout.
  • Restored models/_sources.yml with optional input_database / input_schema (fall back to target.database / target.schema).
  • YAML anchors (*string, *datetime, *float) in both seed yml files.
  • Gated raw_data__* and synthetic_data__* seeds on use_synthetic_data.
  • Seed naming: patient_seed.csvraw_data__patient.csv; added header-only CSVs for 6 missing clinical tables.
  • Cleared deprecations: +batch_size+meta.batch_size; combination_of_columns nested under arguments:; varchar{{ dbt.type_string() }} in extension tests.

How to test

  • Run dbt build --full-refresh with use_synthetic_data: true — seeds run before bridge models, extension-column tests pass on BigQuery.
  • Run dbt build --full-refresh with use_synthetic_data: false and input_database / input_schema pointed at a real input layer — bridge models read from source_input, no synthetic seeds materialize.

Breaking changes

None. Tuva package contract unchanged.


Author: SnowQuery — Healthcare Data Engineering & Architecture Consulting

rabee05 added 8 commits April 24, 2026 03:00
…thetic mode

* Rename patient_seed.csv to raw_data__patient.csv so every clinical
  input seed follows the raw_data__<table> convention that
  tuva_source() resolves to.
* Add empty header-only CSVs for condition, encounter, location,
  medication, practitioner, procedure — gives each clinical table a
  seed relation in synthetic mode even when no synthetic data exists.
* Replace repeated per-column cross-adapter jinja with YAML anchors
  (*string, *datetime, *float) in seeds/_seeds.yml, dropping ~640
  lines of duplication.
* Gate every raw_data__* seed on use_synthetic_data so non-synthetic
  runs no longer materialize unused seed tables.
…sion-column checks

BigQuery rejects "cast(... as varchar)"; use the cross-adapter type
macro so check_extension_columns_in_core_{eligibility,medical_claim,
member_months,pharmacy_claim} tests compile on every warehouse.
Collapse the repeated per-column cross-adapter jinja
(bigquery/databricks string, athena/databricks real, fabric
datetime2, etc.) into three shared anchors — *string, *datetime,
*float — and reference them from every column_types entry.

Also gate each synthetic_data__* seed on use_synthetic_data so
non-synthetic runs stop materializing seeds nothing refs.

Drops ~260 lines of repetition from seeds/synthetic_data/synthetic_data_seeds.yml
without changing runtime type resolution.
… at compile time

Returns ref('raw_data__<table>') when use_synthetic_data is true, or
source('source_input', <table>) otherwise. Because ref() is evaluated
at parse time, dbt wires the seed as an upstream dependency of every
bridge model that calls tuva_source(), so seeds run before their
dependent models without an on-run-start hook or explicit ordering.

Returns the Relation object (not a rendered string), so callers can
use either "from {{ tuva_source('X') }}" or bind it with {% set r =
tuva_source('X') %} for adapter.get_columns_in_relation(r).
… into single SELECT via tuva_source()

Every bridge model had two duplicated SELECTs gated by
use_synthetic_data. Replaced with one SELECT reading from
tuva_source('<table>') — the macro swaps ref() vs source() at
compile time so the toggle is invisible to the model.

With _sources.yml restored, setting input_database + input_schema
now points the same models at the user's own input layer when
use_synthetic_data is false. No model edits needed.
…eprecation

dbt now flags custom top-level config keys; nest the seed
batch_size hint under +meta so the deprecation warning goes away.
dbt 1.10+ requires generic-test arguments to be nested under
arguments (MissingArgumentsPropertyInGenericTestDeprecation).
Updated the hcc_recapture staging/intermediate/final yml tests.
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 24, 2026

Deploy Preview for thetuvaproject canceled.

Name Link
🔨 Latest commit a4902f8
🔍 Latest deploy log https://app.netlify.com/projects/thetuvaproject/deploys/69eaff9c8255360008579aed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 Ready for Review

Development

Successfully merging this pull request may close these issues.

1 participant